Dec 6th

Questions 8, 13 (13g optional), 15(a)

Q1

Describe the null hypotheses to which the p-values given in Table 3.4 correspond. Explain what conclusions you can draw based on these p-values. Your explanation should be phrased in terms of sales, TV, radio, and newspaper, rather than in terms of the coefficients of the linear model.

  • Intercept: the null hypothesis is that there are no sales when there is no advertising. The p-value is significant, indicating that there are some sales without advertising.
  • __TV:__ the null hypothesis is that changing TV advertising while holding radio and newspaper constant does not change sales. The significant p-value indicates that changing TV advertising does affect sales when holding the other predictors constant.
  • radio: the null hypothesis is that changing radio advertising while holding TV and newspaper constant does not change sales. The significant p-value indicates that changing radio advertising does affect sales when holding the other predictors constant.
  • newspaper: the null hypothesis is that changing newspaper advertising while holding radio and TV constant does not change sales. The insignificant p-value indicates that changing newspaper advertising does not affect sales when holding the other predictors constant.

Q2

Carefully explain the differences between the KNN classifier and KNN regression methods.

KNN classifier is when we are trying to predict which group an observation belongs to. The proportion of K nearest neighbors belonging to each group is used for this classification. KNN regression is used when we are tyring to predict a numeric value given predictors. Here the average value for K nearest neighbors, given the predictors, is used.

Q3

Suppose we have a data set with five predictors, X1 = GPA, X2 = IQ, X3 = Gender (1 for Female and 0 for Male), X4 = Interaction between GPA and IQ, and X5 = Interaction between GPA and Gender. The response is starting salary after graduation (in thousands of dollars). Suppose we use least squares to fit the model, and get βˆ0 = 50, βˆ1 = 20 , βˆ2 = 0.07 , βˆ3 = 35 , βˆ4 = 0.01 , βˆ5 = −10 .

(a) Which answer is correct, and why? i. For a fixed value of IQ and GPA, males earn more on average than females. ii. For a fixed value of IQ and GPA, females earn more on average than males. iii. For a fixed value of IQ and GPA, males earn more on average than females provided that the GPA is high enough. iv. For a fixed value of IQ and GPA, females earn more on average than males provided that the GPA is high enough.

The correct answer is iii. Because \(\hat \beta _5\) is negative the postive effect of female gender (\(\hat \beta _3\)) is offset at higher GPAs

(b) Predict the salary of a female with IQ of 110 and a GPA of 4.0.

50 + 20*4 + 110 * 0.07 + 1*35 + 4*110*0.01 + 1*4*-10
## [1] 137.1

(c) True or false: Since the coefficient for the GPA/IQ interaction term is very small, there is very little evidence of an interaction effect. Justify your answer.

FALSE. We need to look at the p-values. also the size of the coefficient is dependent on the scale of the predictor, not signficance per se.

Q8

8. This question involves the use of simple linear regression on the Auto data set.

library(MASS)
library(ISLR)
## Warning: package 'ISLR' was built under R version 3.4.2
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'tidyr' was built under R version 3.4.2
## Warning: package 'purrr' was built under R version 3.4.2
## Warning: package 'dplyr' was built under R version 3.4.2
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
## select(): dplyr, MASS
data(Auto)
auto <- as.tibble(Auto)
auto
## # A tibble: 392 x 9
##      mpg cylinders displacement horsepower weight acceleration  year
##  * <dbl>     <dbl>        <dbl>      <dbl>  <dbl>        <dbl> <dbl>
##  1    18         8          307        130   3504         12.0    70
##  2    15         8          350        165   3693         11.5    70
##  3    18         8          318        150   3436         11.0    70
##  4    16         8          304        150   3433         12.0    70
##  5    17         8          302        140   3449         10.5    70
##  6    15         8          429        198   4341         10.0    70
##  7    14         8          454        220   4354          9.0    70
##  8    14         8          440        215   4312          8.5    70
##  9    14         8          455        225   4425         10.0    70
## 10    15         8          390        190   3850          8.5    70
## # ... with 382 more rows, and 2 more variables: origin <dbl>, name <fctr>

(a) Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output. For example:

lm1 <- lm(mpg ~ horsepower, data = auto)

summary(lm1)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictor and the response?

Yes, there is a highly significant (negative) relationship between horsepower and mpg

ii. How strong is the relationship between the predictor and the response?

given very low p-value and very non-zero t-value, strong

iii. Is the relationship between the predictor and the response positive or negative?

negative

iv. What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confidence and prediction intervals?

cat("predition at horsepower 98 and 95% confidence intervals:\n")
## predition at horsepower 98 and 95% confidence intervals:
predict(lm1,data.frame(horsepower=98),interval = "conf")
##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108
cat("\npredition at horsepower 98 and 95% prediction intervals:\n")
## 
## predition at horsepower 98 and 95% prediction intervals:
predict(lm1,data.frame(horsepower=98),interval = "pred")
##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

(b) Plot the response and the predictor. Use the abline() function to display the least squares regression line.

auto %>% 
  ggplot(aes(x=horsepower, y=mpg)) +
           geom_point() +
           geom_smooth(method = "lm", se=FALSE)

(c) Use the plot() function to produce diagnostic plots of the least squares regression fit. Comment on any problems you see with the fit.

library(modelr)
auto %>% 
  add_residuals(lm1) %>%
  ggplot(aes(x=horsepower, y = resid)) +
  geom_point() +
  geom_hline(yintercept=0)

Residuals not evenly distributed!

plot(lm1,ask = FALSE)

## Q9. This question involves the use of multiple linear regression on the Auto data set.

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

data(Auto)
auto <- as_tibble(Auto)
pairs(auto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

auto %>% select(-name) %>% cor() %>% knitr::kable(digits = 3)
mpg cylinders displacement horsepower weight acceleration year origin
mpg 1.000 -0.778 -0.805 -0.778 -0.832 0.423 0.581 0.565
cylinders -0.778 1.000 0.951 0.843 0.898 -0.505 -0.346 -0.569
displacement -0.805 0.951 1.000 0.897 0.933 -0.544 -0.370 -0.615
horsepower -0.778 0.843 0.897 1.000 0.865 -0.689 -0.416 -0.455
weight -0.832 0.898 0.933 0.865 1.000 -0.417 -0.309 -0.585
acceleration 0.423 -0.505 -0.544 -0.689 -0.417 1.000 0.290 0.213
year 0.581 -0.346 -0.370 -0.416 -0.309 0.290 1.000 0.182
origin 0.565 -0.569 -0.615 -0.455 -0.585 0.213 0.182 1.000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.

lm.auto <- lm(mpg ~ cylinders +
                displacement +
                horsepower +
                weight +
                acceleration +
                year + 
                origin,
              data=auto)
summary(lm.auto)
## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     acceleration + year + origin, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

For instance: i. Is there a relationship between the predictors and the re- sponse?

Yes, based on over-all p-value of < 2.2 e -16

ii. Which predictors appear to have a statistically significant relationship to the response?

displacement, weight, year, origin

iii. What does the coefficient for the year variable suggest?

mpg increases ~ 0.75 per year when other predictors are held constant.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

plot(lm.auto)

Residuals vs fitted suggest there may be some non-linearity since there is a u-shaped curve.

QQ plot: a few more high residuals than expected.

There is a high leverage point but its residual is not that big so no serious problems there.

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

Only use those predictors that were significant in the MR. Only fit two way interactions

lm.auto.int <- lm(mpg ~ 
                displacement +
                weight +
                year + 
                origin +
                  displacement:weight +
                  displacement:year +
                  displacement:origin +
                  weight:year +
                  weight:origin +
                  year:origin,
              data=auto)
summary(lm.auto.int)
## 
## Call:
## lm(formula = mpg ~ displacement + weight + year + origin + displacement:weight + 
##     displacement:year + displacement:origin + weight:year + weight:origin + 
##     year:origin, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8970 -1.5806 -0.1199  1.2215 14.1451 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.792e+01  2.496e+01  -0.718  0.47325    
## displacement         3.382e-02  8.295e-02   0.408  0.68370    
## weight              -8.284e-03  1.119e-02  -0.740  0.45970    
## year                 9.045e-01  3.237e-01   2.795  0.00546 ** 
## origin              -5.649e+00  5.352e+00  -1.055  0.29195    
## displacement:weight  1.806e-05  2.762e-06   6.540 1.98e-10 ***
## displacement:year   -1.593e-03  1.137e-03  -1.401  0.16189    
## displacement:origin  1.605e-02  1.276e-02   1.258  0.20930    
## weight:year          5.751e-06  1.512e-04   0.038  0.96968    
## weight:origin       -1.343e-03  9.465e-04  -1.418  0.15688    
## year:origin          9.457e-02  6.619e-02   1.429  0.15387    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.95 on 381 degrees of freedom
## Multiple R-squared:  0.8608, Adjusted R-squared:  0.8571 
## F-statistic: 235.6 on 10 and 381 DF,  p-value: < 2.2e-16
plot(lm.auto.int)

The non-linearity suggested from the fitted vs residuals is gone, but there may now be some heteroscedasctity. QQ plot shows some outliers.

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

First lets look again at a pairs plot using the significant predictors from the MR:

auto %>% 
  select(mpg, displacement, weight, year, origin) %>%
  pairs()

Both displacement and weight show some non-linearity w.r.t. mpg

auto %>% select(mpg,displacement) %>%
  mutate(dis2 = displacement^2, 
         logdisp=log(displacement),
         sqrtdispl=sqrt(displacement)) %>%
  pairs()

auto %>% select(mpg,weight) %>%
  mutate(weight2 = weight^2, 
         logweight=log(weight),
         sqrtweight=sqrt(weight)) %>%
  pairs()

The log transformations appear to provide the best linearity with mpg

lm.auto.trans <- lm(mpg ~ 
                log(displacement) +
                log(weight) +
                year + 
                origin,
              data=auto)
summary(lm.auto.trans)
## 
## Call:
## lm(formula = mpg ~ log(displacement) + log(weight) + year + origin, 
##     data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0612  -1.9499  -0.0303   1.6006  13.0971 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       112.42433    9.88747  11.370   <2e-16 ***
## log(displacement)  -0.51061    0.97706  -0.523   0.6016    
## log(weight)       -18.38713    1.70131 -10.808   <2e-16 ***
## year                0.77502    0.04569  16.962   <2e-16 ***
## origin              0.69822    0.26673   2.618   0.0092 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.126 on 387 degrees of freedom
## Multiple R-squared:  0.8412, Adjusted R-squared:  0.8395 
## F-statistic: 512.5 on 4 and 387 DF,  p-value: < 2.2e-16
plot(lm.auto.trans)

compaere this to unrransformed

lm.auto.small <- lm(mpg ~ 
                displacement +
                weight +
                year + 
                origin,
              data=auto)
summary(lm.auto.small)
## 
## Call:
## lm(formula = mpg ~ displacement + weight + year + origin, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8102 -2.1129 -0.0388  1.7725 13.2085 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.861e+01  4.028e+00  -4.620 5.25e-06 ***
## displacement  5.588e-03  4.768e-03   1.172    0.242    
## weight       -6.575e-03  5.571e-04 -11.802  < 2e-16 ***
## year          7.714e-01  4.981e-02  15.486  < 2e-16 ***
## origin        1.226e+00  2.670e-01   4.593 5.92e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.346 on 387 degrees of freedom
## Multiple R-squared:  0.8181, Adjusted R-squared:  0.8162 
## F-statistic: 435.1 on 4 and 387 DF,  p-value: < 2.2e-16
plot(lm.auto.small)

The R-squared is slightly better using the transformed predictors.

Q10 This question should be answered using the Carseats data set.

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

data("Carseats")
lm10a <- lm(Sales ~ Price + Urban + US,data = Carseats)
summary(lm10a)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

  • Intercept: The predicted base level of saled if carseats were free
  • Price: Sales decreased by 54 units per increase in price (if other predicotrs held constant).
  • Urban: Are sales affected by urban vs rural store location? N.S.
  • US: Being in the US increases sales by 1200 units (when other factors are constant).

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

\[ Sales \sim \beta_0 + \beta_1*Price + \beta_2*Urban + \beta_3*US + \epsilon \]

Or did they mean

\[ Sales = 13.04 - 0.05 * Price - 0.02*UrbanYes + 1.2*USYes \] (d) For which of the predictors can you reject the null hypothesis H0 :βj =0?

Intercept, price, US

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

lm10e <- update(lm10a, ~ . - Urban)
summary(lm10e)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?

Adjusted R-squared is a bit below 0.24 meaning there is a lot of unexplained variance.

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(lm10e)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

plot(lm10e)

nope, looks pretty good.

Also try some 3D plotting:

library(plot3D)
library(broom)
## 
## Attaching package: 'broom'
## The following object is masked from 'package:modelr':
## 
##     bootstrap
library(modelr)

Q 13

13. In this exercise you will create some simulated data and will fit simple linear regression models to it. Make sure to use set.seed(1) prior to starting part (a) to ensure consistent results.

(a) Using the rnorm() function, create a vector, x, containing 100 observations drawn from a N (0, 1) distribution. This represents a feature, X. (b) Using the rnorm() function, create a vector, eps, containing 100 observations drawn from a N(0,0.25) distribution i.e. a normal distribution with mean zero and variance 0.25. (c) Using x and eps, generate a vector y according to the model Y =−1+0.5X+ε. (3.39) What is the length of the vector y? What are the values of β0 and β1 in this linear model?

set.seed(1)
data13 <- tibble(
  x = rnorm(100,mean = 0, sd = 1),
  eps = rnorm(100,0, sqrt(0.25)),
  y = 1 + 0.5*x + eps)

data13
## # A tibble: 100 x 3
##             x         eps         y
##         <dbl>       <dbl>     <dbl>
##  1 -0.6264538 -0.31018334 0.3765898
##  2  0.1836433  0.02105794 1.1128796
##  3 -0.8356286 -0.45546082 0.1267249
##  4  1.5952808  0.07901439 1.8766548
##  5  0.3295078 -0.32729232 0.8374616
##  6 -0.8204684  0.88364363 1.4734094
##  7  0.4874291  0.35835374 1.6020683
##  8  0.7383247  0.45508711 1.8242495
##  9  0.5757814  0.19209268 1.4799834
## 10 -0.3053884  0.84108804 1.6883938
## # ... with 90 more rows

\[\beta_0 = 1\]

\[\beta_1=0.5\] length of y is 100

(d) Create a scatterplot displaying the relationship between x and y. Comment on what you observe.

data13 %>% qplot(x,y,data=.)

Pretty nice linear relationship, but some scatter…

(e) Fit a least squares linear model to predict y using x. Comment on the model obtained. How do βˆ0 and βˆ1 compare to β0 and β1?

lm13 <- lm(y~x, data=data13)
summary(lm13)
## 
## Call:
## lm(formula = y ~ x, data = data13)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93842 -0.30688 -0.06975  0.26970  1.17309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.98115    0.04849  20.233  < 2e-16 ***
## x            0.49947    0.05386   9.273 4.58e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4814 on 98 degrees of freedom
## Multiple R-squared:  0.4674, Adjusted R-squared:  0.4619 
## F-statistic: 85.99 on 1 and 98 DF,  p-value: 4.583e-15
confint(lm13)
##                 2.5 %    97.5 %
## (Intercept) 0.8849196 1.0773878
## x           0.3925794 0.6063602

The estimates are near the real values and the 95% confidence intervals include the real values

(f) Display the least squares line on the scatterplot obtained in (d). Draw the population regression line on the plot, in a different color. Use the legend() command to create an appropriate legend.

data13 %>% ggplot(aes(x=x,y=y)) +
  geom_point() +
  geom_smooth(aes(color="blue"),method="lm",se=FALSE) +
  geom_abline(aes(intercept=1,slope=.5,color="red")) +
  scale_color_manual(name="legend",values=c("red"="red","blue"="blue"),labels=c("population","sample"), guide="legend")

(g) Now fit a polynomial regression model that predicts y using x and x2. Is there evidence that the quadratic term improves the model fit? Explain your answer.

lm13b <- lm(y ~ x + I(x^2), data=data13)
summary(lm13b)
## 
## Call:
## lm(formula = y ~ x + I(x^2), data = data13)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98252 -0.31270 -0.06441  0.29014  1.13500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.02836    0.05883  17.481  < 2e-16 ***
## x            0.50858    0.05399   9.420  2.4e-15 ***
## I(x^2)      -0.05946    0.04238  -1.403    0.164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.479 on 97 degrees of freedom
## Multiple R-squared:  0.4779, Adjusted R-squared:  0.4672 
## F-statistic:  44.4 on 2 and 97 DF,  p-value: 2.038e-14
confint(lm13b)
##                  2.5 %     97.5 %
## (Intercept)  0.9116007 1.14511430
## x            0.4014226 0.61573832
## I(x^2)      -0.1435788 0.02465763

The fits are pretty similar, so no real improvement. RSE and R-squeared similar.

(h) Repeat (a)–(f) after modifying the data generation process in such a way that there is less noise in the data. The model (3.39) should remain the same. You can do this by decreasing the variance of the normal distribution used to generate the error term ε in (b). Describe your results.

data13h <- tibble(
  x = rnorm(100,mean = 0, sd = 1),
  eps = rnorm(100,0, sqrt(0.1)),
  y = 1 + 0.5*x + eps)

data13h
## # A tibble: 100 x 3
##              x         eps         y
##          <dbl>       <dbl>     <dbl>
##  1  0.40940184  0.28260444 1.4873054
##  2  1.68887329 -0.33118475 1.5132519
##  3  1.58658843  0.62339162 2.4166858
##  4 -0.33090780 -0.12131512 0.7132310
##  5 -2.28523554  0.52308667 0.3804689
##  6  2.49766159  0.47820364 2.7270344
##  7  0.66706617  0.02623607 1.3597692
##  8  0.54132734  0.17937100 1.4500347
##  9 -0.01339952 -0.32399068 0.6693096
## 10  0.51010842  0.10214362 1.3571978
## # ... with 90 more rows
lm13h <- lm(y ~ x, data=data13h)
summary(lm13h)
## 
## Call:
## lm(formula = y ~ x, data = data13h)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86703 -0.17753 -0.00553  0.21495  0.58452 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.01532    0.03134   32.40   <2e-16 ***
## x            0.53359    0.03044   17.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3133 on 98 degrees of freedom
## Multiple R-squared:  0.7582, Adjusted R-squared:  0.7557 
## F-statistic: 307.3 on 1 and 98 DF,  p-value: < 2.2e-16
confint(lm13h)
##                 2.5 %    97.5 %
## (Intercept) 0.9531317 1.0775107
## x           0.4731823 0.5939962
data13h %>%
  ggplot(aes(x=x,y=y)) +
  geom_point() +
  geom_smooth(aes(color="blue"),method="lm",se=FALSE) +
  geom_abline(aes(intercept=1,slope=.5,color="red")) +
  scale_color_manual(name="legend",values=c("red"="red","blue"="blue"),labels=c("population","sample"), guide="legend")

smaller confidence intervals, lower RSE, higher R^2

(i) Repeat (a)–(f) after modifying the data generation process in such a way that there is more noise in the data. The model (3.39) should remain the same. You can do this by increasing the variance of the normal distribution used to generate the error term ε in (b). Describe your results.

data13i <- tibble(
  x = rnorm(100,mean = 0, sd = 1),
  eps = rnorm(100,0, sqrt(0.5)),
  y = 1 + 0.5*x + eps)

data13h
## # A tibble: 100 x 3
##              x         eps         y
##          <dbl>       <dbl>     <dbl>
##  1  0.40940184  0.28260444 1.4873054
##  2  1.68887329 -0.33118475 1.5132519
##  3  1.58658843  0.62339162 2.4166858
##  4 -0.33090780 -0.12131512 0.7132310
##  5 -2.28523554  0.52308667 0.3804689
##  6  2.49766159  0.47820364 2.7270344
##  7  0.66706617  0.02623607 1.3597692
##  8  0.54132734  0.17937100 1.4500347
##  9 -0.01339952 -0.32399068 0.6693096
## 10  0.51010842  0.10214362 1.3571978
## # ... with 90 more rows
lm13i <- lm(y ~ x, data=data13i)
summary(lm13i)
## 
## Call:
## lm(formula = y ~ x, data = data13i)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7749 -0.4281  0.0146  0.4984  1.4777 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.96645    0.06842  14.125  < 2e-16 ***
## x            0.44700    0.05876   7.607 1.73e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6838 on 98 degrees of freedom
## Multiple R-squared:  0.3713, Adjusted R-squared:  0.3648 
## F-statistic: 57.87 on 1 and 98 DF,  p-value: 1.731e-11
confint(lm13i)
##                 2.5 %    97.5 %
## (Intercept) 0.8306640 1.1022281
## x           0.3303927 0.5636136
data13i %>%
  ggplot(aes(x=x,y=y)) +
  geom_point() +
  geom_smooth(aes(color="blue"),method="lm",se=FALSE) +
  geom_abline(aes(intercept=1,slope=.5,color="red")) +
  scale_color_manual(name="legend",values=c("red"="red","blue"="blue"),labels=c("population","sample"), guide="legend")

(j) What are the confidence intervals for β0 and β1 based on the original data set, the noisier data set, and the less noisy data set? Comment on your results.

Already discussed above in part, but…

map(list(original=lm13,lower_var=lm13h,higher_var=lm13i),confint)
## $original
##                 2.5 %    97.5 %
## (Intercept) 0.8849196 1.0773878
## x           0.3925794 0.6063602
## 
## $lower_var
##                 2.5 %    97.5 %
## (Intercept) 0.9531317 1.0775107
## x           0.4731823 0.5939962
## 
## $higher_var
##                 2.5 %    97.5 %
## (Intercept) 0.8306640 1.1022281
## x           0.3303927 0.5636136
map(list(original=lm13,lower_var=lm13h,higher_var=lm13i),confint) %>%
  map(as.data.frame) %>%
  map(rownames_to_column,var="coefficient") %>%
  bind_rows(.id="data_source") %>%
  ggplot(aes(x=data_source,ymin=`2.5 %`,ymax=`97.5 %`,color=coefficient)) +
  geom_linerange(lwd=3) + 
  ggtitle("95% Confidence Intervals") +
  geom_hline(yintercept = c(0.5,1))

Q14. This problem focuses on the collinearity problem.

(a) Perform the following commands in R:

set.seed(1)
x1=runif(100)
x2=0.5*x1+rnorm(100)/10
y=2+2*x1+0.3*x2+rnorm(100)

The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coefficients?

\[ y = \beta_0 + \beta_1 * x1 + \beta_2 * x2 + \epsilon \] \[ where \beta_0=2, \beta_1=2, and \beta_2=0.3 \] (b) What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.

cor(x1,x2)
## [1] 0.8351212
plot(x1,x2)

(c) Using this data, fit a least squares regression to predict y using x1 and x2. Describe the results obtained. What are βˆ0, βˆ1, and βˆ2? How do these relate to the true β0, β1, and β2? Can you reject the null hypothesis H0 : β1 = 0? How about the null hypothesis H0 : β2 = 0?

lm14c <- lm(y ~ x1 + x2)
summary(lm14c)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

\(\beta_1\) is underestimated and \(\beta_2\) is overestimated. Overall poor fit based on R-squared. \(\beta_1\) significant but not \(\beta_2\)

(d) Now fit a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H0 :β1 =0?

lm14d <- lm(y ~ x1)
summary(lm14d)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

R2 is the same, but now \(\beta_1\) is estimated much better.

(e) Now fit a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0 :β1 =0?

lm14e <- lm(y ~ x2)
summary(lm14e)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

R2 is less; \(\beta_2\) now even more overestimated, but it is significant.

(f) Do the results obtained in (c)–(e) contradict each other? Explain your answer.

Yes, in that \(\beta_2\) is not significant in c but it is in e. This is because of the co-linearity.

(g) Now suppose we obtain one additional observation, which was unfortunately mismeasured.

x1=c(x1, 0.1)
x2=c(x2, 0.8)
y=c(y,6)

Re-fit the linear models from (c) to (e) using this new data. What effect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.

lm14c2 <- lm(y ~ x1 + x2)
summary(lm14c2)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06
plot(lm14c2)

lm14d2 <- lm(y ~ x1)
summary(lm14d2)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05
plot(lm14d2)

lm14e2 <- lm(y ~ x2)
summary(lm14e2)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06
plot(lm14e2)

This point is a high-leverage outlier. It flips which coefficient is significant in the MR

15a

15. This problem involves the Boston data set, which we saw in the lab for this chapter. We will now try to predict per capita crime rate using the other variables in this data set. In other words, per capita crime rate is the response, and the other variables are the predictors.

(a) For each predictor, fit a simple linear regression model to predict the response. Describe your results. In which of the models is there a statistically significant association between the predictor and the response? Create some plots to back up your assertions.

data(Boston)
boston <- as_tibble(Boston)
boston
## # A tibble: 506 x 14
##       crim    zn indus  chas   nox    rm   age    dis   rad   tax ptratio
##  *   <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>  <dbl> <int> <dbl>   <dbl>
##  1 0.00632  18.0  2.31     0 0.538 6.575  65.2 4.0900     1   296    15.3
##  2 0.02731   0.0  7.07     0 0.469 6.421  78.9 4.9671     2   242    17.8
##  3 0.02729   0.0  7.07     0 0.469 7.185  61.1 4.9671     2   242    17.8
##  4 0.03237   0.0  2.18     0 0.458 6.998  45.8 6.0622     3   222    18.7
##  5 0.06905   0.0  2.18     0 0.458 7.147  54.2 6.0622     3   222    18.7
##  6 0.02985   0.0  2.18     0 0.458 6.430  58.7 6.0622     3   222    18.7
##  7 0.08829  12.5  7.87     0 0.524 6.012  66.6 5.5605     5   311    15.2
##  8 0.14455  12.5  7.87     0 0.524 6.172  96.1 5.9505     5   311    15.2
##  9 0.21124  12.5  7.87     0 0.524 5.631 100.0 6.0821     5   311    15.2
## 10 0.17004  12.5  7.87     0 0.524 6.004  85.9 6.5921     5   311    15.2
## # ... with 496 more rows, and 3 more variables: black <dbl>, lstat <dbl>,
## #   medv <dbl>
predictors <- colnames(boston)[-1]
lmfits <- map(predictors,function(x) lm(crim ~ get(x), data=boston))
lmsummaries <- lapply(lmfits,summary)
names(lmsummaries) <- predictors
lmsummaries
## $zn
## 
## Call:
## lm(formula = crim ~ get(x), data = boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.429 -4.222 -2.620  1.250 84.523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.45369    0.41722  10.675  < 2e-16 ***
## get(x)      -0.07393    0.01609  -4.594 5.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.435 on 504 degrees of freedom
## Multiple R-squared:  0.04019,    Adjusted R-squared:  0.03828 
## F-statistic:  21.1 on 1 and 504 DF,  p-value: 5.506e-06
## 
## 
## $indus
## 
## Call:
## lm(formula = crim ~ get(x), data = boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.972  -2.698  -0.736   0.712  81.813 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.06374    0.66723  -3.093  0.00209 ** 
## get(x)       0.50978    0.05102   9.991  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.866 on 504 degrees of freedom
## Multiple R-squared:  0.1653, Adjusted R-squared:  0.1637 
## F-statistic: 99.82 on 1 and 504 DF,  p-value: < 2.2e-16
## 
## 
## $chas
## 
## Call:
## lm(formula = crim ~ get(x), data = boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.738 -3.661 -3.435  0.018 85.232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.7444     0.3961   9.453   <2e-16 ***
## get(x)       -1.8928     1.5061  -1.257    0.209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.597 on 504 degrees of freedom
## Multiple R-squared:  0.003124,   Adjusted R-squared:  0.001146 
## F-statistic: 1.579 on 1 and 504 DF,  p-value: 0.2094
## 
## 
## $nox
## 
## Call:
## lm(formula = crim ~ get(x), data = boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.371  -2.738  -0.974   0.559  81.728 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -13.720      1.699  -8.073 5.08e-15 ***
## get(x)        31.249      2.999  10.419  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.81 on 504 degrees of freedom
## Multiple R-squared:  0.1772, Adjusted R-squared:  0.1756 
## F-statistic: 108.6 on 1 and 504 DF,  p-value: < 2.2e-16
## 
## 
## $rm
## 
## Call:
## lm(formula = crim ~ get(x), data = boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.604 -3.952 -2.654  0.989 87.197 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   20.482      3.365   6.088 2.27e-09 ***
## get(x)        -2.684      0.532  -5.045 6.35e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.401 on 504 degrees of freedom
## Multiple R-squared:  0.04807,    Adjusted R-squared:  0.04618 
## F-statistic: 25.45 on 1 and 504 DF,  p-value: 6.347e-07
## 
## 
## $age
## 
## Call:
## lm(formula = crim ~ get(x), data = boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.789 -4.257 -1.230  1.527 82.849 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.77791    0.94398  -4.002 7.22e-05 ***
## get(x)       0.10779    0.01274   8.463 2.85e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.057 on 504 degrees of freedom
## Multiple R-squared:  0.1244, Adjusted R-squared:  0.1227 
## F-statistic: 71.62 on 1 and 504 DF,  p-value: 2.855e-16
## 
## 
## $dis
## 
## Call:
## lm(formula = crim ~ get(x), data = boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.708 -4.134 -1.527  1.516 81.674 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.4993     0.7304  13.006   <2e-16 ***
## get(x)       -1.5509     0.1683  -9.213   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.965 on 504 degrees of freedom
## Multiple R-squared:  0.1441, Adjusted R-squared:  0.1425 
## F-statistic: 84.89 on 1 and 504 DF,  p-value: < 2.2e-16
## 
## 
## $rad
## 
## Call:
## lm(formula = crim ~ get(x), data = boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.164  -1.381  -0.141   0.660  76.433 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.28716    0.44348  -5.157 3.61e-07 ***
## get(x)       0.61791    0.03433  17.998  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.718 on 504 degrees of freedom
## Multiple R-squared:  0.3913, Adjusted R-squared:   0.39 
## F-statistic: 323.9 on 1 and 504 DF,  p-value: < 2.2e-16
## 
## 
## $tax
## 
## Call:
## lm(formula = crim ~ get(x), data = boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.513  -2.738  -0.194   1.065  77.696 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.528369   0.815809  -10.45   <2e-16 ***
## get(x)       0.029742   0.001847   16.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.997 on 504 degrees of freedom
## Multiple R-squared:  0.3396, Adjusted R-squared:  0.3383 
## F-statistic: 259.2 on 1 and 504 DF,  p-value: < 2.2e-16
## 
## 
## $ptratio
## 
## Call:
## lm(formula = crim ~ get(x), data = boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.654 -3.985 -1.912  1.825 83.353 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.6469     3.1473  -5.607 3.40e-08 ***
## get(x)        1.1520     0.1694   6.801 2.94e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.24 on 504 degrees of freedom
## Multiple R-squared:  0.08407,    Adjusted R-squared:  0.08225 
## F-statistic: 46.26 on 1 and 504 DF,  p-value: 2.943e-11
## 
## 
## $black
## 
## Call:
## lm(formula = crim ~ get(x), data = boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.756  -2.299  -2.095  -1.296  86.822 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.553529   1.425903  11.609   <2e-16 ***
## get(x)      -0.036280   0.003873  -9.367   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.946 on 504 degrees of freedom
## Multiple R-squared:  0.1483, Adjusted R-squared:  0.1466 
## F-statistic: 87.74 on 1 and 504 DF,  p-value: < 2.2e-16
## 
## 
## $lstat
## 
## Call:
## lm(formula = crim ~ get(x), data = boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.925  -2.822  -0.664   1.079  82.862 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.33054    0.69376  -4.801 2.09e-06 ***
## get(x)       0.54880    0.04776  11.491  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.664 on 504 degrees of freedom
## Multiple R-squared:  0.2076, Adjusted R-squared:  0.206 
## F-statistic:   132 on 1 and 504 DF,  p-value: < 2.2e-16
## 
## 
## $medv
## 
## Call:
## lm(formula = crim ~ get(x), data = boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.071 -4.022 -2.343  1.298 80.957 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.79654    0.93419   12.63   <2e-16 ***
## get(x)      -0.36316    0.03839   -9.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.934 on 504 degrees of freedom
## Multiple R-squared:  0.1508, Adjusted R-squared:  0.1491 
## F-statistic: 89.49 on 1 and 504 DF,  p-value: < 2.2e-16
sapply(lmsummaries, function(x) x$coefficients["get(x)","Pr(>|t|)"])
##           zn        indus         chas          nox           rm 
## 5.506472e-06 1.450349e-21 2.094345e-01 3.751739e-23 6.346703e-07 
##          age          dis          rad          tax      ptratio 
## 2.854869e-16 8.519949e-19 2.693844e-56 2.357127e-47 2.942922e-11 
##        black        lstat         medv 
## 2.487274e-19 2.654277e-27 1.173987e-19
sapply(lmsummaries, function(x) x$adj.r.squared)
##         zn      indus       chas        nox         rm        age 
## 0.03828352 0.16365394 0.00114594 0.17558468 0.04618036 0.12268419 
##        dis        rad        tax    ptratio      black      lstat 
## 0.14245126 0.39004886 0.33830395 0.08225111 0.14658431 0.20601869 
##       medv 
## 0.14909551

OR

(see http://r4ds.had.co.nz/many-models.html)

create a nested data object, one data frame for each predictor

library(modelr)
boston.nest <- boston %>% 
  gather(key="predictor",value="value",-crim) %>%
  group_by(predictor) %>%
  nest()
boston.nest # a set of data frames, one for each predictor
## # A tibble: 13 x 2
##    predictor               data
##        <chr>             <list>
##  1        zn <tibble [506 x 2]>
##  2     indus <tibble [506 x 2]>
##  3      chas <tibble [506 x 2]>
##  4       nox <tibble [506 x 2]>
##  5        rm <tibble [506 x 2]>
##  6       age <tibble [506 x 2]>
##  7       dis <tibble [506 x 2]>
##  8       rad <tibble [506 x 2]>
##  9       tax <tibble [506 x 2]>
## 10   ptratio <tibble [506 x 2]>
## 11     black <tibble [506 x 2]>
## 12     lstat <tibble [506 x 2]>
## 13      medv <tibble [506 x 2]>
boston.nest$data[[1]]
## # A tibble: 506 x 2
##       crim value
##      <dbl> <dbl>
##  1 0.00632  18.0
##  2 0.02731   0.0
##  3 0.02729   0.0
##  4 0.03237   0.0
##  5 0.06905   0.0
##  6 0.02985   0.0
##  7 0.08829  12.5
##  8 0.14455  12.5
##  9 0.21124  12.5
## 10 0.17004  12.5
## # ... with 496 more rows

Fit a model to each dataframe

fitModel <- function(df) lm(crim ~ value, data=df)
boston.nest <- boston.nest%>%
  mutate(model=map(data,fitModel),
         model.summary=map(model,summary))
boston.nest
## # A tibble: 13 x 4
##    predictor               data    model    model.summary
##        <chr>             <list>   <list>           <list>
##  1        zn <tibble [506 x 2]> <S3: lm> <S3: summary.lm>
##  2     indus <tibble [506 x 2]> <S3: lm> <S3: summary.lm>
##  3      chas <tibble [506 x 2]> <S3: lm> <S3: summary.lm>
##  4       nox <tibble [506 x 2]> <S3: lm> <S3: summary.lm>
##  5        rm <tibble [506 x 2]> <S3: lm> <S3: summary.lm>
##  6       age <tibble [506 x 2]> <S3: lm> <S3: summary.lm>
##  7       dis <tibble [506 x 2]> <S3: lm> <S3: summary.lm>
##  8       rad <tibble [506 x 2]> <S3: lm> <S3: summary.lm>
##  9       tax <tibble [506 x 2]> <S3: lm> <S3: summary.lm>
## 10   ptratio <tibble [506 x 2]> <S3: lm> <S3: summary.lm>
## 11     black <tibble [506 x 2]> <S3: lm> <S3: summary.lm>
## 12     lstat <tibble [506 x 2]> <S3: lm> <S3: summary.lm>
## 13      medv <tibble [506 x 2]> <S3: lm> <S3: summary.lm>

Now, use broom::glance to get summaries

library(broom)
boston_glance <- boston.nest %>%
  mutate(glance=map(model,glance)) %>%
  unnest(glance,.drop = TRUE)
boston_glance
## # A tibble: 13 x 12
##    predictor   r.squared adj.r.squared    sigma  statistic      p.value
##        <chr>       <dbl>         <dbl>    <dbl>      <dbl>        <dbl>
##  1        zn 0.040187908    0.03828352 8.435290  21.102782 5.506472e-06
##  2     indus 0.165310070    0.16365394 7.866281  99.817037 1.450349e-21
##  3      chas 0.003123869    0.00114594 8.596615   1.579364 2.094345e-01
##  4       nox 0.177217182    0.17558468 7.809972 108.555329 3.751739e-23
##  5        rm 0.048069117    0.04618036 8.400586  25.450204 6.346703e-07
##  6       age 0.124421452    0.12268419 8.056649  71.619402 2.854869e-16
##  7       dis 0.144149375    0.14245126 7.965369  84.887810 8.519949e-19
##  8       rad 0.391256687    0.39004886 6.717752 323.935172 2.693844e-56
##  9       tax 0.339614243    0.33830395 6.996901 259.190294 2.357127e-47
## 10   ptratio 0.084068439    0.08225111 8.240212  46.259453 2.942922e-11
## 11     black 0.148274239    0.14658431 7.946150  87.739763 2.487274e-19
## 12     lstat 0.207590933    0.20601869 7.664461 132.035125 2.654277e-27
## 13      medv 0.150780469    0.14909551 7.934451  89.486115 1.173987e-19
## # ... with 6 more variables: df <int>, logLik <dbl>, AIC <dbl>, BIC <dbl>,
## #   deviance <dbl>, df.residual <int>

old way…

print out the summaries (but no names!)

boston.nest$model.summary
## [[1]]
## 
## Call:
## lm(formula = crim ~ value, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.429 -4.222 -2.620  1.250 84.523 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.45369    0.41722  10.675  < 2e-16 ***
## value       -0.07393    0.01609  -4.594 5.51e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.435 on 504 degrees of freedom
## Multiple R-squared:  0.04019,    Adjusted R-squared:  0.03828 
## F-statistic:  21.1 on 1 and 504 DF,  p-value: 5.506e-06
## 
## 
## [[2]]
## 
## Call:
## lm(formula = crim ~ value, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.972  -2.698  -0.736   0.712  81.813 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.06374    0.66723  -3.093  0.00209 ** 
## value        0.50978    0.05102   9.991  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.866 on 504 degrees of freedom
## Multiple R-squared:  0.1653, Adjusted R-squared:  0.1637 
## F-statistic: 99.82 on 1 and 504 DF,  p-value: < 2.2e-16
## 
## 
## [[3]]
## 
## Call:
## lm(formula = crim ~ value, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.738 -3.661 -3.435  0.018 85.232 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.7444     0.3961   9.453   <2e-16 ***
## value        -1.8928     1.5061  -1.257    0.209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.597 on 504 degrees of freedom
## Multiple R-squared:  0.003124,   Adjusted R-squared:  0.001146 
## F-statistic: 1.579 on 1 and 504 DF,  p-value: 0.2094
## 
## 
## [[4]]
## 
## Call:
## lm(formula = crim ~ value, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.371  -2.738  -0.974   0.559  81.728 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -13.720      1.699  -8.073 5.08e-15 ***
## value         31.249      2.999  10.419  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.81 on 504 degrees of freedom
## Multiple R-squared:  0.1772, Adjusted R-squared:  0.1756 
## F-statistic: 108.6 on 1 and 504 DF,  p-value: < 2.2e-16
## 
## 
## [[5]]
## 
## Call:
## lm(formula = crim ~ value, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.604 -3.952 -2.654  0.989 87.197 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   20.482      3.365   6.088 2.27e-09 ***
## value         -2.684      0.532  -5.045 6.35e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.401 on 504 degrees of freedom
## Multiple R-squared:  0.04807,    Adjusted R-squared:  0.04618 
## F-statistic: 25.45 on 1 and 504 DF,  p-value: 6.347e-07
## 
## 
## [[6]]
## 
## Call:
## lm(formula = crim ~ value, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.789 -4.257 -1.230  1.527 82.849 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.77791    0.94398  -4.002 7.22e-05 ***
## value        0.10779    0.01274   8.463 2.85e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.057 on 504 degrees of freedom
## Multiple R-squared:  0.1244, Adjusted R-squared:  0.1227 
## F-statistic: 71.62 on 1 and 504 DF,  p-value: 2.855e-16
## 
## 
## [[7]]
## 
## Call:
## lm(formula = crim ~ value, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.708 -4.134 -1.527  1.516 81.674 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.4993     0.7304  13.006   <2e-16 ***
## value        -1.5509     0.1683  -9.213   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.965 on 504 degrees of freedom
## Multiple R-squared:  0.1441, Adjusted R-squared:  0.1425 
## F-statistic: 84.89 on 1 and 504 DF,  p-value: < 2.2e-16
## 
## 
## [[8]]
## 
## Call:
## lm(formula = crim ~ value, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.164  -1.381  -0.141   0.660  76.433 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.28716    0.44348  -5.157 3.61e-07 ***
## value        0.61791    0.03433  17.998  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.718 on 504 degrees of freedom
## Multiple R-squared:  0.3913, Adjusted R-squared:   0.39 
## F-statistic: 323.9 on 1 and 504 DF,  p-value: < 2.2e-16
## 
## 
## [[9]]
## 
## Call:
## lm(formula = crim ~ value, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.513  -2.738  -0.194   1.065  77.696 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.528369   0.815809  -10.45   <2e-16 ***
## value        0.029742   0.001847   16.10   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.997 on 504 degrees of freedom
## Multiple R-squared:  0.3396, Adjusted R-squared:  0.3383 
## F-statistic: 259.2 on 1 and 504 DF,  p-value: < 2.2e-16
## 
## 
## [[10]]
## 
## Call:
## lm(formula = crim ~ value, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.654 -3.985 -1.912  1.825 83.353 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.6469     3.1473  -5.607 3.40e-08 ***
## value         1.1520     0.1694   6.801 2.94e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.24 on 504 degrees of freedom
## Multiple R-squared:  0.08407,    Adjusted R-squared:  0.08225 
## F-statistic: 46.26 on 1 and 504 DF,  p-value: 2.943e-11
## 
## 
## [[11]]
## 
## Call:
## lm(formula = crim ~ value, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.756  -2.299  -2.095  -1.296  86.822 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.553529   1.425903  11.609   <2e-16 ***
## value       -0.036280   0.003873  -9.367   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.946 on 504 degrees of freedom
## Multiple R-squared:  0.1483, Adjusted R-squared:  0.1466 
## F-statistic: 87.74 on 1 and 504 DF,  p-value: < 2.2e-16
## 
## 
## [[12]]
## 
## Call:
## lm(formula = crim ~ value, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.925  -2.822  -0.664   1.079  82.862 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.33054    0.69376  -4.801 2.09e-06 ***
## value        0.54880    0.04776  11.491  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.664 on 504 degrees of freedom
## Multiple R-squared:  0.2076, Adjusted R-squared:  0.206 
## F-statistic:   132 on 1 and 504 DF,  p-value: < 2.2e-16
## 
## 
## [[13]]
## 
## Call:
## lm(formula = crim ~ value, data = df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.071 -4.022 -2.343  1.298 80.957 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.79654    0.93419   12.63   <2e-16 ***
## value       -0.36316    0.03839   -9.46   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.934 on 504 degrees of freedom
## Multiple R-squared:  0.1508, Adjusted R-squared:  0.1491 
## F-statistic: 89.49 on 1 and 504 DF,  p-value: < 2.2e-16

We can add columns for whatever we want to extract from the summaries….

boston.nest.sum <- boston.nest %>%
  mutate(p_value=map_dbl(model.summary,function(x) x$coefficients["value","Pr(>|t|)"]),
         adj.r.squared=map_dbl(model.summary, function(x) x$adj.r.squared))
boston.nest.sum
## # A tibble: 13 x 6
##    predictor               data    model    model.summary      p_value
##        <chr>             <list>   <list>           <list>        <dbl>
##  1        zn <tibble [506 x 2]> <S3: lm> <S3: summary.lm> 5.506472e-06
##  2     indus <tibble [506 x 2]> <S3: lm> <S3: summary.lm> 1.450349e-21
##  3      chas <tibble [506 x 2]> <S3: lm> <S3: summary.lm> 2.094345e-01
##  4       nox <tibble [506 x 2]> <S3: lm> <S3: summary.lm> 3.751739e-23
##  5        rm <tibble [506 x 2]> <S3: lm> <S3: summary.lm> 6.346703e-07
##  6       age <tibble [506 x 2]> <S3: lm> <S3: summary.lm> 2.854869e-16
##  7       dis <tibble [506 x 2]> <S3: lm> <S3: summary.lm> 8.519949e-19
##  8       rad <tibble [506 x 2]> <S3: lm> <S3: summary.lm> 2.693844e-56
##  9       tax <tibble [506 x 2]> <S3: lm> <S3: summary.lm> 2.357127e-47
## 10   ptratio <tibble [506 x 2]> <S3: lm> <S3: summary.lm> 2.942922e-11
## 11     black <tibble [506 x 2]> <S3: lm> <S3: summary.lm> 2.487274e-19
## 12     lstat <tibble [506 x 2]> <S3: lm> <S3: summary.lm> 2.654277e-27
## 13      medv <tibble [506 x 2]> <S3: lm> <S3: summary.lm> 1.173987e-19
## # ... with 1 more variables: adj.r.squared <dbl>
  1. Fit a multiple regression model to predict the response using all of the predictors. Describe your results. For which predictors can we reject the null hypothesis H0 : βj = 0?
library(stringr)
fo <- as.formula(str_c("crim ~ ", str_c(predictors, collapse = " + ")))
lm15b <- lm(fo,data = boston)
summary(lm15b)
## 
## Call:
## lm(formula = fo, data = boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.924 -2.120 -0.353  1.019 75.051 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17.033228   7.234903   2.354 0.018949 *  
## zn            0.044855   0.018734   2.394 0.017025 *  
## indus        -0.063855   0.083407  -0.766 0.444294    
## chas         -0.749134   1.180147  -0.635 0.525867    
## nox         -10.313535   5.275536  -1.955 0.051152 .  
## rm            0.430131   0.612830   0.702 0.483089    
## age           0.001452   0.017925   0.081 0.935488    
## dis          -0.987176   0.281817  -3.503 0.000502 ***
## rad           0.588209   0.088049   6.680 6.46e-11 ***
## tax          -0.003780   0.005156  -0.733 0.463793    
## ptratio      -0.271081   0.186450  -1.454 0.146611    
## black        -0.007538   0.003673  -2.052 0.040702 *  
## lstat         0.126211   0.075725   1.667 0.096208 .  
## medv         -0.198887   0.060516  -3.287 0.001087 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.439 on 492 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4396 
## F-statistic: 31.47 on 13 and 492 DF,  p-value: < 2.2e-16

This model explains about 44% of the variance, with zn, dis, rad, black, and medv being significant predictors

(c) How do your results from (a) compare to your results from (b)? Create a plot displaying the univariate regression coefficients from (a) on the x-axis, and the multiple regression coefficients from (b) on the y-axis. That is, each predictor is displayed as a single point in the plot. Its coefficient in a simple linear regres- sion model is shown on the x-axis, and its coefficient estimate in the multiple linear regression model is shown on the y-axis.

get the simple model regression coefficients

coefs <- boston.nest %>%
  mutate(coefficients=map(model,tidy)) %>%
  unnest(coefficients,.drop=TRUE) %>%
  filter(term=="value") %>%
  dplyr::select(predictor,simple.est=estimate)
coefs
## # A tibble: 13 x 2
##    predictor  simple.est
##        <chr>       <dbl>
##  1        zn -0.07393498
##  2     indus  0.50977633
##  3      chas -1.89277655
##  4       nox 31.24853120
##  5        rm -2.68405122
##  6       age  0.10778623
##  7       dis -1.55090168
##  8       rad  0.61791093
##  9       tax  0.02974225
## 10   ptratio  1.15198279
## 11     black -0.03627964
## 12     lstat  0.54880478
## 13      medv -0.36315992

add the multiple regression coefficients

coefs <- as.data.frame(coef(lm15b)) %>%
  rownames_to_column(var="predictor") %>%
  rename(multiple.est=`coef(lm15b)`) %>%
  right_join(coefs)
## Joining, by = "predictor"

plot it

coefs %>% 
  ggplot(aes(x=simple.est,y=multiple.est,label=predictor)) +
  geom_point() +
  geom_text(nudge_x = 1)

Whoa, radically different!

  1. Is there evidence of non-linear association between any of the predictors and the response? To answer this question, for each predictor X, fit a model of the form Y = β0 +β1X +β2X2 +β3X3 +ε.
fitModelNL <- function(df) lm(crim ~ value + I(value^2) + I(value^3), data=df)
boston.nest <- boston.nest %>%
  mutate(modelNL=map(data,fitModelNL))
boston.nest  
## # A tibble: 13 x 5
##    predictor               data    model    model.summary  modelNL
##        <chr>             <list>   <list>           <list>   <list>
##  1        zn <tibble [506 x 2]> <S3: lm> <S3: summary.lm> <S3: lm>
##  2     indus <tibble [506 x 2]> <S3: lm> <S3: summary.lm> <S3: lm>
##  3      chas <tibble [506 x 2]> <S3: lm> <S3: summary.lm> <S3: lm>
##  4       nox <tibble [506 x 2]> <S3: lm> <S3: summary.lm> <S3: lm>
##  5        rm <tibble [506 x 2]> <S3: lm> <S3: summary.lm> <S3: lm>
##  6       age <tibble [506 x 2]> <S3: lm> <S3: summary.lm> <S3: lm>
##  7       dis <tibble [506 x 2]> <S3: lm> <S3: summary.lm> <S3: lm>
##  8       rad <tibble [506 x 2]> <S3: lm> <S3: summary.lm> <S3: lm>
##  9       tax <tibble [506 x 2]> <S3: lm> <S3: summary.lm> <S3: lm>
## 10   ptratio <tibble [506 x 2]> <S3: lm> <S3: summary.lm> <S3: lm>
## 11     black <tibble [506 x 2]> <S3: lm> <S3: summary.lm> <S3: lm>
## 12     lstat <tibble [506 x 2]> <S3: lm> <S3: summary.lm> <S3: lm>
## 13      medv <tibble [506 x 2]> <S3: lm> <S3: summary.lm> <S3: lm>
boston_glanceNL <- boston.nest %>%
  mutate(glanceNL=map(modelNL,glance)) %>%
  unnest(glanceNL,.drop = TRUE) %>%
right_join(boston_glance,by="predictor",suffix=c(".poly",".linear"))

boston_glanceNL %>%
  ggplot(aes(x=adj.r.squared.linear,y=adj.r.squared.poly,label=predictor)) +
  geom_point() +
  geom_text(nudge_x = .015) +
  geom_abline(slope = 1, intercept = 0)

Improved fit for some.